Computation and Language 34
☆ NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
Complex reasoning ability is one of the most important features of current
LLMs, which has also been leveraged to play an integral role in complex
decision-making tasks. Therefore, the investigation into the reasoning
capabilities of Large Language Models (LLMs) is critical: numerous benchmarks
have been established to assess the reasoning abilities of LLMs. However,
current benchmarks are inadequate in offering a rigorous evaluation of the full
extent of reasoning abilities that LLMs are capable of achieving. They are also
prone to the risk of overfitting, as these benchmarks, being publicly
accessible and static, allow models to potentially tailor their responses to
specific benchmark metrics, thereby inflating their performance. Addressing
these limitations, our research introduces a new benchmark, named NPHardEval.
This benchmark is designed to evaluate the reasoning abilities of LLMs across a
broad spectrum of 900 algorithmic questions, extending up to the NP-Hard
complexity class. These questions are meticulously chosen to represent a wide
range of complexity class below the NP-hard complexity class, offering a
rigorous measure of the reasoning ability of LLMs. Through this study, we shed
light on the current state of reasoning in LLMs, providing an objective and
rigorous perspective through the comparison of LLMs' performance across complex
classes. Moreover, this benchmark is designed with a dynamic update mechanism,
where the datapoints are refreshed on a monthly basis. Such regular updates
play a crucial role in mitigating the risk of LLMs overfitting to the
benchmark, promoting a more accurate and reliable assessment of their reasoning
capabilities. The benchmark dataset and code of NPHardEval are available at
https://github.com/casmlab/NPHardEval.
comment: 22 pages, 6 figures, 2 tables
☆ Robust Knowledge Extraction from Large Language Models using Social Choice Theory AAMAS 2024
Large-language models (LLMs) have the potential to support a wide range of
applications like conversational agents, creative writing, text improvement,
and general query answering. However, they are ill-suited for query answering
in high-stake domains like medicine because they generate answers at random and
their answers are typically not robust - even the same query can result in
different answers when prompted multiple times. In order to improve the
robustness of LLM queries, we propose using ranking queries repeatedly and to
aggregate the queries using methods from social choice theory. We study ranking
queries in diagnostic settings like medical and fault diagnosis and discuss how
the Partial Borda Choice function from the literature can be applied to merge
multiple query results. We discuss some additional interesting properties in
our setting and evaluate the robustness of our approach empirically.
comment: Accepted by AAMAS 2024 as a full paper
☆ Numerical Reasoning for Financial Reports
Financial reports offer critical insights into a company's operations, yet
their extensive length typically spanning 30 40 pages poses challenges for
swift decision making in dynamic markets. To address this, we leveraged
finetuned Large Language Models (LLMs) to distill key indicators and
operational metrics from these reports basis questions from the user. We
devised a method to locate critical data, and leverage the FinQA dataset to
fine-tune both Llama-2 7B and T5 models for customized question answering. We
achieved results comparable to baseline on the final numerical answer, a
competitive accuracy in numerical reasoning and calculation.
comment: 10 pages, 11 figures, 6 tables
☆ VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
In the rapidly advancing field of conditional image generation research,
challenges such as limited explainability lie in effectively evaluating the
performance and capabilities of various models. This paper introduces VIESCORE,
a Visual Instruction-guided Explainable metric for evaluating any conditional
image generation tasks. VIESCORE leverages general knowledge from Multimodal
Large Language Models (MLLMs) as the backbone and does not require training or
fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
human ratings in the generation tasks but struggles in editing tasks. With
these results, we believe VIESCORE shows its great potential to replace human
judges in evaluating image synthesis tasks.
☆ YAYI 2: Multilingual Open-Source Large Language Models
Yin Luo, Qingchao Kong, Nan Xu, Jia Cao, Bao Hao, Baoyu Qu, Bo Chen, Chao Zhu, Chenyang Zhao, Donglei Zhang, Fan Feng, Feifei Zhao, Hailong Sun, Hanxuan Yang, Haojun Pan, Hongyu Liu, Jianbin Guo, Jiangtao Du, Jingyi Wang, Junfeng Li, Lei Sun, Liduo Liu, Lifeng Dong, Lili Liu, Lin Wang, Liwen Zhang, Minzheng Wang, Pin Wang, Ping Yu, Qingxiao Li, Rui Yan, Rui Zou, Ruiqun Li, Taiwen Huang, Xiaodong Wang, Xiaofei Wu, Xin Peng, Xina Zhang, Xing Fang, Xinglin Xiao, Yanni Hao, Yao Dong, Yigang Wang, Ying Liu, Yongyu Jiang, Yungan Wang, Yuqi Wang, Zhangsheng Wang, Zhaoxin Yu, Zhen Luo, Wenji Mao, Lei Wang, Dajun Zeng
As the latest advancements in natural language processing, large language
models (LLMs) have achieved human-level language understanding and generation
abilities in many real-world tasks, and even have been regarded as a potential
path to the artificial general intelligence. To better facilitate research on
LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been
proposed and gained comparable performances to proprietary models. However,
these models are primarily designed for English scenarios and exhibit poor
performances in Chinese contexts. In this technical report, we propose YAYI 2,
including both base and chat models, with 30 billion parameters. YAYI 2 is
pre-trained from scratch on a multilingual corpus which contains 2.65 trillion
tokens filtered by our pre-training data processing pipeline. The base model is
aligned with human values through supervised fine-tuning with millions of
instructions and reinforcement learning from human feedback. Extensive
experiments on multiple benchmarks, such as MMLU and CMMLU, consistently
demonstrate that the proposed YAYI 2 outperforms other similar sized
open-source models.
☆ On the Use of Metaphor Translation in Psychiatry
Providing mental healthcare to individuals with limited English proficiency
(LEP) remains a pressing problem within psychiatry. Because the majority of
individuals trained in providing psychiatric care are English speakers, the
quality of mental healthcare given to LEP patients is significantly lower than
that provided for English speakers. The provision of mental healthcare is
contingent on communication and understanding between the patient and
healthcare provider, much more so than in the realm of physical healthcare, and
English speakers are often unable to comprehend figurative language such as
metaphors used by LEPs. Hence, Figurative Language Translation is invaluable to
providing equitable psychiatric care. Now, metaphor has been shown to be
paramount in both identifying individuals struggling with mental problems and
helping those individuals understand and communicate their experiences.
Therefore, this paper aims to survey the potential of Machine Translation for
providing equitable psychiatric healthcare and highlights the need for further
research on the transferability of existing machine and metaphor translation
research in the domain of psychiatry.
☆ Semantic Parsing for Complex Data Retrieval: Targeting Query Plans vs. SQL for No-Code Access to Relational Databases
Large Language Models (LLMs) have spurred progress in text-to-SQL, the task
of generating SQL queries from natural language questions based on a given
database schema. Despite the declarative nature of SQL, it continues to be a
complex programming language. In this paper, we investigate the potential of an
alternative query language with simpler syntax and modular specification of
complex queries. The purpose is to create a query language that can be learned
more easily by modern neural semantic parsing architectures while also enabling
non-programmers to better assess the validity of the query plans produced by an
interactive query plan assistant.
The proposed alternative query language is called Query Plan Language (QPL).
It is designed to be modular and can be translated into a restricted form of
SQL Common Table Expressions (CTEs). The aim of QPL is to make complex data
retrieval accessible to non-programmers by allowing users to express their
questions in natural language while also providing an easier-to-verify target
language. The paper demonstrates how neural LLMs can benefit from QPL's
modularity to generate complex query plans in a compositional manner. This
involves a question decomposition strategy and a planning stage.
We conduct experiments on a version of the Spider text-to-SQL dataset that
has been converted to QPL. The hierarchical structure of QPL programs enables
us to measure query complexity naturally. Based on this assessment, we identify
the low accuracy of existing text-to-SQL systems on complex compositional
queries. We present ways to address the challenge of complex queries in an
iterative, user-controlled manner, using fine-tuned LLMs and a variety of
prompting strategies in a compositional manner.
comment: arXiv admin note: text overlap with arXiv:2310.13575
☆ Large Language Model (LLM) Bias Index -- LLMBI
The Large Language Model Bias Index (LLMBI) is a pioneering approach designed
to quantify and address biases inherent in large language models (LLMs), such
as GPT-4. We recognise the increasing prevalence and impact of LLMs across
diverse sectors. This research introduces a novel metric, LLMBI, to
systematically measure and mitigate biases potentially skewing model responses.
We formulated LLMBI using a composite scoring system incorporating multiple
dimensions of bias, including but not limited to age, gender, and racial
biases.
To operationalise this metric, we engaged in a multi-step process involving
collecting and annotating LLM responses, applying sophisticated Natural
Language Processing (NLP) techniques for bias detection, and computing the
LLMBI score through a specially crafted mathematical formula. The formula
integrates weighted averages of various bias dimensions, a penalty for dataset
diversity deficiencies, and a correction for sentiment biases. Our empirical
analysis, conducted using responses from OpenAI's API, employs advanced
sentiment analysis as a representative method for bias detection.
The research reveals LLMs, whilst demonstrating impressive capabilities in
text generation, exhibit varying degrees of bias across different dimensions.
LLMBI provides a quantifiable measure to compare biases across models and over
time, offering a vital tool for systems engineers, researchers and regulators
in enhancing the fairness and reliability of LLMs. It highlights the potential
of LLMs in mimicking unbiased human-like responses. Additionally, it
underscores the necessity of continuously monitoring and recalibrating such
models to align with evolving societal norms and ethical standards.
☆ Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar ACL
We present a compositional semantics for various types of polar questions and
wh-questions within the framework of Combinatory Categorial Grammar (CCG). To
assess the explanatory power of our proposed analysis, we introduce a
question-answering dataset QSEM specifically designed to evaluate the semantics
of interrogative sentences. We implement our analysis using existing CCG
parsers and conduct evaluations using the dataset. Through the evaluation, we
have obtained annotated data with CCG trees and semantic representations for
about half of the samples included in QSEM. Furthermore, we discuss the
discrepancy between the theoretical capacity of CCG and the capabilities of
existing CCG parsers.
comment: 11 pages, to appear in the Proceedings of PACLIC37
☆ Balancing the Style-Content Trade-Off in Sentiment Transfer Using Polarity-Aware Denoising
Text sentiment transfer aims to flip the sentiment polarity of a sentence
(positive to negative or vice versa) while preserving its sentiment-independent
content. Although current models show good results at changing the sentiment,
content preservation in transferred sentences is insufficient. In this paper,
we present a sentiment transfer model based on polarity-aware denoising, which
accurately controls the sentiment attributes in generated text, preserving the
content to a great extent and helping to balance the style-content trade-off.
Our proposed model is structured around two key stages in the sentiment
transfer process: better representation learning using a shared encoder and
sentiment-controlled generation using separate sentiment-specific decoders.
Empirical results show that our methods outperforms state-of-the-art baselines
in terms of content preservation while staying competitive in terms of style
transfer accuracy and fluency.
comment: Published in 25th International Conference on Text, Speech and
Dialogue (TSD 2022)
☆ Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference AAAI 2024
Electronic health records (EHRs) have become the foundation of machine
learning applications in healthcare, while the utility of real patient records
is often limited by privacy and security concerns. Synthetic EHR generation
provides an additional perspective to compensate for this limitation. Most
existing methods synthesize new records based on real EHR data, without
consideration of different types of events in EHR data, which cannot control
the event combinations in line with medical common sense. In this paper, we
propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR
synthesis to address these limitations. First, we formulate the synthetic EHR
generation process as a probabilistic graphical model and tightly connect
different types of events by modeling the latent health states. Then, we derive
a health state inference method tailored for the multi-visit scenario to
effectively utilize previous records to synthesize current and future records.
Furthermore, we propose to generate medical reports to add textual descriptions
for each medical event, providing broader applications for synthesized EHR
data. For generating different paragraphs in each visit, we incorporate a
multi-generator deliberation framework to collaborate the message passing of
multiple generators and employ a two-phase decoding strategy to generate
high-quality reports. Our extensive experiments on the widely used benchmarks,
MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results
on the quality of synthetic data while maintaining low privacy risks.
comment: Accepted at AAAI 2024
☆ BLSTM-Based Confidence Estimation for End-to-End Speech Recognition ICASSP 2021
Confidence estimation, in which we estimate the reliability of each
recognized token (e.g., word, sub-word, and character) in automatic speech
recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an
important function for developing ASR applications. In this study, we perform
confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR
systems show high performance (e.g., around 5% token error rates) for various
ASR tasks. In such situations, confidence estimation becomes difficult since we
need to detect infrequent incorrect tokens from mostly correct token sequences.
To tackle this imbalanced dataset problem, we employ a bidirectional long
short-term memory (BLSTM)-based model as a strong binary-class
(correct/incorrect) sequence labeler that is trained with a class balancing
objective. We experimentally confirmed that, by utilizing several types of ASR
decoding scores as its auxiliary features, the model steadily shows high
confidence estimation performance under highly imbalanced settings. We also
confirmed that the BLSTM-based model outperforms Transformer-based confidence
estimation models, which greatly underestimate incorrect tokens.
comment: Accepted to ICASSP 2021
☆ Reasons to Reject? Aligning Language Models with Judgments
As humans, we consistently engage in interactions with our peers and receive
feedback in the form of natural language. This language feedback allows us to
reflect on our actions, maintain appropriate behavior, and rectify our errors.
The question arises naturally: can we use language feedback to align large
language models (LLMs)? In contrast to previous research that aligns LLMs with
reward or preference data, we present the first systematic exploration of
alignment through the lens of language feedback (i.e., judgment). We commence
with an in-depth investigation of potential methods that can be adapted for
aligning LLMs with judgments, revealing that these methods are unable to fully
capitalize on the judgments. To facilitate more effective utilization of
judgments, we propose a novel framework, Contrastive Unlikelihood Training
(CUT), that allows for fine-grained inappropriate content detection and
correction based on judgments. Our offline alignment results show that, with
merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B
DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The
online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b)
in an iterative fashion using model-specific judgment data, with a steady
performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis
further suggests that judgments exhibit greater potential than rewards for LLM
alignment and warrant future research.
comment: Our source codes and models are publicly available at
https://github.com/wwxu21/CUT
☆ SIG: Speaker Identification in Literature via Prompt-Based Generation AAAI 2024
Identifying speakers of quotations in narratives is an important task in
literary analysis, with challenging scenarios including the out-of-domain
inference for unseen speakers, and non-explicit cases where there are no
speaker mentions in surrounding context. In this work, we propose a simple and
effective approach SIG, a generation-based method that verbalizes the task and
quotation input based on designed prompt templates, which also enables easy
integration of other auxiliary tasks that further bolster the speaker
identification performance. The prediction can either come from direct
generation by the model, or be determined by the highest generation probability
of each speaker candidate. Based on our approach design, SIG supports
out-of-domain evaluation, and achieves open-world classification paradigm that
is able to accept any forms of candidate input. We perform both cross-domain
evaluation and in-domain evaluation on PDNC, the largest dataset of this task,
where empirical results suggest that SIG outperforms previous baselines of
complicated designs, as well as the zero-shot ChatGPT, especially excelling at
those hard non-explicit scenarios by up to 17% improvement. Additional
experiments on another dataset WP further corroborate the efficacy of SIG.
comment: Accepted to AAAI 2024
☆ Aurora:Activating Chinese chat capability for Mistral-8x7B sparse Mixture-of-Experts through Instruction-Tuning
Rongsheng Wang, Haoming Chen, Ruizhe Zhou, Yaofei Duan, Kunyan Cai, Han Ma, Jiaxi Cui, Jian Li, Patrick Cheong-Iao Pang, Yapeng Wang, Tao Tan
Existing research has demonstrated that refining large language models (LLMs)
through the utilization of machine-generated instruction-following data
empowers these models to exhibit impressive zero-shot capabilities for novel
tasks, without requiring human-authored instructions. In this paper, we
systematically investigate, preprocess, and integrate three Chinese
instruction-following datasets with the aim of enhancing the Chinese
conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model.
Through instruction fine-tuning on this carefully processed dataset, we
successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named
"Aurora." To assess the performance of Aurora, we utilize three widely
recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate
the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse
Mixture-of-Experts model. This work is pioneering in the execution of
instruction fine-tuning on a sparse expert-mixed model, marking a significant
breakthrough in enhancing the capabilities of this model architecture. Our
code, data and model are publicly available at:
https://github.com/WangRongsheng/Aurora
comment: 10 pages, 2 figures
☆ Automatic Data Retrieval for Cross Lingual Summarization
Cross-lingual summarization involves the summarization of text written in one
language to a different one. There is a body of research addressing
cross-lingual summarization from English to other European languages. In this
work, we aim to perform cross-lingual summarization from English to Hindi. We
propose pairing up the coverage of newsworthy events in textual and video
format can prove to be helpful for data acquisition for cross lingual
summarization. We analyze the data and propose methods to match articles to
video descriptions that serve as document and summary pairs. We also outline
filtering methods over reasonable thresholds to ensure the correctness of the
summaries. Further, we make available 28,583 mono and cross-lingual
article-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also
build and analyze multiple baselines on the collected data and report error
analysis.
comment: 6 pages, 6 tables, 2 figures, conference: ICON 2023
☆ Theory of Hallucinations based on Equivariance
Equivariance is an important feature in machine learning, including language
models. It ensures that any sequences of phrases with the same meanings are
interpreted consistently. For example, the sentence 'There is a cat on the
table' should be interpreted by language models as it is, regardless of
variations in its token-level expression. Building on this insight, I propose a
new theory suggesting that insufficient equivariance in language models can
lead to hallucinations. According to this theory, which is both intuitive and
novel, language models trained on relatively small datasets tend to
misinterpret input texts and/or generate incorrect texts (i.e.,
hallucinations). To test this theory, I developed a toy model known as 'dancing
men', which is a character-level substitution cipher. Additionally, I propose a
novel technique based on the T5 (Text To Text Transfer Transformer) model to
efficiently decipher these codes without relying on frequency analysis. I have
found that this T5 model can almost completely solve the cipher, demonstrating
its ability to acquire equivariance in this frame. This method could be scaled
up to word-level and sentence-level substitution ciphers, analogous to large
language models without tokenizers or dictionaries. This scalability makes it
suitable for investigating the proposed link between inadequate equivariance
acquisition and the emergence of hallucinations.
☆ Language Model is a Branch Predictor for Simultaneous Machine Translation ICASSP 2024
The primary objective of simultaneous machine translation (SiMT) is to
minimize latency while preserving the quality of the final translation. Drawing
inspiration from CPU branch prediction techniques, we propose incorporating
branch prediction techniques in SiMT tasks to reduce translation latency.
Specifically, we utilize a language model as a branch predictor to predict
potential branch directions, namely, future source words. Subsequently, we
utilize the predicted source words to decode the output in advance. When the
actual source word deviates from the predicted source word, we use the real
source word to decode the output again, replacing the predicted output. To
further reduce computational costs, we share the parameters of the encoder and
the branch predictor, and utilize a pre-trained language model for
initialization. Our proposed method can be seamlessly integrated with any SiMT
model. Extensive experimental results demonstrate that our approach can improve
translation quality and latency at the same time. Our code is available at
https://github.com/YinAoXiong/simt_branch_predictor .
comment: Accepted by IEEE ICASSP 2024
☆ MetaAID 2.5: A Secure Framework for Developing Metaverse Applications via Large Language Models
Large language models (LLMs) are increasingly being used in Metaverse
environments to generate dynamic and realistic content and to control the
behavior of non-player characters (NPCs). However, the cybersecurity concerns
associated with LLMs have become increasingly prominent. Previous research has
primarily focused on patching system vulnerabilities to enhance cybersecurity,
but these approaches are not well-suited to the Metaverse, where the virtual
space is more complex, LLMs are vulnerable, and ethical user interaction is
critical. Moreover, the scope of cybersecurity in the Metaverse is expected to
expand significantly. This paper proposes a method for enhancing cybersecurity
through the simulation of user interaction with LLMs. Our goal is to educate
users and strengthen their defense capabilities through exposure to a
comprehensive simulation system. This system includes extensive Metaverse
cybersecurity Q&A and attack simulation scenarios. By engaging with these,
users will improve their ability to recognize and withstand risks.
Additionally, to address the ethical implications of user input, we propose
using LLMs as evaluators to assess user content across five dimensions. We
further adapt the models through vocabulary expansion training to better
understand personalized inputs and emoticons. We conduct experiments on
multiple LLMs and find that our approach is effective.
☆ Efficacy of Machine-Generated Instructions
Large "instruction-tuned" language models (i.e., finetuned to respond to
instructions) have demonstrated a remarkable ability to generalize zero-shot to
new tasks. Nevertheless, they depend heavily on human-written instruction data
that is often limited in quantity, diversity, and creativity, therefore
hindering the generality of the tuned model. We conducted a quantitative study
to figure out the efficacy of machine-generated annotations, where we compare
the results of a fine-tuned BERT model with human v/s machine-generated
annotations. Applying our methods to the vanilla GPT-3 model, we saw that
machine generated annotations were 78.54% correct and the fine-tuned model
achieved a 96.01% model performance compared to the performance with
human-labelled annotations. This result shows that machine-generated
annotations are a resource and cost effective way to fine-tune down-stream
models.
comment: 8 pages, 2 pages references, 6 Tables, 8 Figures
☆ Don't Believe Everything You Read: Enhancing Summarization Interpretability through Automatic Identification of Hallucinations in Large Language Models
Priyesh Vakharia, Devavrat Joshi, Meenal Chavan, Dhananjay Sonawane, Bhrigu Garg, Parsa Mazaheri, Ian Lane
Large Language Models (LLMs) are adept at text manipulation -- tasks such as
machine translation and text summarization. However, these models can also be
prone to hallucination, which can be detrimental to the faithfulness of any
answers that the model provides. Recent works in combating hallucinations in
LLMs deal with identifying hallucinated sentences and categorizing the
different ways in which models hallucinate. This paper takes a deep dive into
LLM behavior with respect to hallucinations, defines a token-level approach to
identifying different kinds of hallucinations, and further utilizes this
token-level tagging to improve the interpretability and faithfulness of LLMs in
dialogue summarization tasks. Through this, the paper presents a new, enhanced
dataset and a new training paradigm.
comment: All authors contributed equally to this work
☆ Logic-Scaffolding: Personalized Aspect-Instructed Recommendation Explanation Generation using LLMs WSDM 2024
The unique capabilities of Large Language Models (LLMs), such as the natural
language text generation ability, position them as strong candidates for
providing explanation for recommendations. However, despite the size of the
LLM, most existing models struggle to produce zero-shot explanations reliably.
To address this issue, we propose a framework called Logic-Scaffolding, that
combines the ideas of aspect-based explanation and chain-of-thought prompting
to generate explanations through intermediate reasoning steps. In this paper,
we share our experience in building the framework and present an interactive
demonstration for exploring our results.
comment: The 17th ACM International Conference on Web Search and Data Mining
(WSDM 2024)
♻ ☆ Next Steps for Human-Centered Generative AI: A Technical Perspective
Xiang 'Anthony' Chen, Jeff Burke, Ruofei Du, Matthew K. Hong, Jennifer Jacobs, Philippe Laban, Dingzeyu Li, Nanyun Peng, Karl D. D. Willis, Chien-Sheng Wu, Bolei Zhou
Through iterative, cross-disciplinary discussions, we define and propose
next-steps for Human-centered Generative AI (HGAI). We contribute a
comprehensive research agenda that lays out future directions of Generative AI
spanning three levels: aligning with human values; assimilating human intents;
and augmenting human abilities. By identifying these next-steps, we intend to
draw interdisciplinary research teams to pursue a coherent set of emergent
ideas in HGAI, focusing on their interested topics while maintaining a coherent
big picture of the future work landscape.
♻ ☆ Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization EMNLP 2023
Large language models (LLMs) have exhibited considerable cross-lingual
generalization abilities, whereby they implicitly transfer knowledge across
languages. However, the transfer is not equally successful for all languages,
especially for low-resource ones, which poses an ongoing challenge. It is
unclear whether we have reached the limits of implicit cross-lingual
generalization and if explicit knowledge transfer is viable. In this paper, we
investigate the potential for explicitly aligning conceptual correspondence
between languages to enhance cross-lingual generalization. Using the syntactic
aspect of language as a testbed, our analyses of 43 languages reveal a high
degree of alignability among the spaces of structural concepts within each
language for both encoder-only and decoder-only LLMs. We then propose a
meta-learning-based method to learn to align conceptual spaces of different
languages, which facilitates zero-shot and few-shot generalization in concept
classification and also offers insights into the cross-lingual in-context
learning phenomenon. Experiments on syntactic analysis tasks show that our
approach achieves competitive results with state-of-the-art methods and narrows
the performance gap between languages, particularly benefiting those with
limited resources.
comment: Findings of EMNLP 2023 (Camera-Ready)
♻ ☆ Unsupervised Melody-to-Lyric Generation ACL 2023
Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Yiwen Chen, Tagyoung Chung, Jing Huang, Nanyun Peng
Automatic melody-to-lyric generation is a task in which song lyrics are
generated to go with a given melody. It is of significant practical interest
and more challenging than unconstrained lyric generation as the music imposes
additional constraints onto the lyrics. The training data is limited as most
songs are copyrighted, resulting in models that underfit the complicated
cross-modal relationship between melody and lyrics. In this work, we propose a
method for generating high-quality lyrics without training on any aligned
melody-lyric data. Specifically, we design a hierarchical lyric generation
framework that first generates a song outline and second the complete lyrics.
The framework enables disentanglement of training (based purely on text) from
inference (melody-guided text generation) to circumvent the shortage of
parallel data.
We leverage the segmentation and rhythm alignment between melody and lyrics
to compile the given melody into decoding constraints as guidance during
inference. The two-step hierarchical design also enables content control via
the lyric outline, a much-desired feature for democratizing collaborative song
creation. Experimental results show that our model can generate high-quality
lyrics that are more on-topic, singable, intelligible, and coherent than strong
baselines, for example SongMASS, a SOTA model trained on a parallel dataset,
with a 24% relative overall quality improvement based on human ratings.
comment: ACL 2023. arXiv admin note: substantial text overlap with
arXiv:2305.07760
♻ ☆ How Far Have We Gone in Vulnerability Detection Using Large Language Models
As software becomes increasingly complex and prone to vulnerabilities,
automated vulnerability detection is critically important, yet challenging.
Given the significant successes of large language models (LLMs) in various
tasks, there is growing anticipation of their efficacy in vulnerability
detection. However, a quantitative understanding of their potential in
vulnerability detection is still missing. To bridge this gap, we introduce a
comprehensive vulnerability benchmark VulBench. This benchmark aggregates
high-quality data from a wide range of CTF (Capture-the-Flag) challenges and
real-world applications, with annotations for each vulnerable function
detailing the vulnerability type and its root cause. Through our experiments
encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models
and static analyzers, we find that several LLMs outperform traditional deep
learning approaches in vulnerability detection, revealing an untapped potential
in LLMs. This work contributes to the understanding and utilization of LLMs for
enhanced software security.
♻ ☆ In-Context Probing: Toward Building Robust Classifiers via Probing Large Language Models
Large language models are able to learn new tasks in context, where they are
provided with instructions and a few annotated examples. However, the
effectiveness of in-context learning is dependent on the provided context, and
the performance on a downstream task can vary considerably, depending on the
instruction. Importantly, such dependency on the context can surface in
unpredictable ways, e.g., a seemingly more informative instruction might lead
to a worse performance. In this paper, we propose an alternative approach,
which we term In-Context Probing (ICP). Similar to in-context learning, we
contextualize the representation of the input with an instruction, but instead
of decoding the output prediction, we probe the contextualized representation
to predict the label. Through a series of experiments on a diverse set of
classification tasks, we show that in-context probing is significantly more
robust to changes in instructions. We further show that ICP performs
competitive or superior to finetuning and can be particularly helpful to build
classifiers on top of smaller models, with less than a hundred training
examples.
♻ ☆ Aligning Language Models with Human Preferences via a Bayesian Approach NeurIPS 2023
In the quest to advance human-centric natural language generation (NLG)
systems, ensuring alignment between NLG models and human preferences is
crucial. For this alignment, current popular methods leverage a reinforcement
learning (RL) approach with a reward model trained on feedback from humans.
However, inherent disagreements due to the subjective nature of human
preferences pose a significant challenge for training the reward model,
resulting in a deterioration of the NLG performance. To tackle this issue,
previous approaches typically rely on majority voting or averaging to
consolidate multiple inconsistent preferences into a merged one. Although
straightforward to understand and execute, such methods suffer from an
inability to capture the nuanced degrees of disaggregation among humans and may
only represent a specialized subset of individuals, thereby lacking the ability
to quantitatively disclose the universality of human preferences. To address
this challenge, this paper proposes a novel approach, which employs a Bayesian
framework to account for the distribution of disagreements among human
preferences as training a preference model, and names it as d-PM. Besides,
considering the RL strategy's inefficient and complex training process over the
training efficiency, we further propose utilizing the contrastive learning
strategy to train the NLG model with the preference scores derived from the
d-PM model. Extensive experiments on two human-centric NLG tasks, i.e.,
emotional support conversation and integrity "Rule-of-Thumb" generation, show
that our method consistently exceeds previous SOTA models in both automatic and
human evaluations.
comment: NeurIPS 2023
♻ ☆ Text normalization for low-resource languages: the case of Ligurian
Text normalization is a crucial technology for low-resource languages which
lack rigid spelling conventions or that have undergone multiple spelling
reforms. Low-resource text normalization has so far relied upon hand-crafted
rules, which are perceived to be more data efficient than neural methods. In
this paper we examine the case of text normalization for Ligurian, an
endangered Romance language. We collect 4,394 Ligurian sentences paired with
their normalized versions, as well as the first open source monolingual corpus
for Ligurian. We show that, in spite of the small amounts of data available, a
compact transformer-based model can be trained to achieve very low error rates
by the use of backtranslation and appropriate tokenization.
♻ ☆ Prompt-Based Editing for Text Style Transfer EMNLP
Prompting approaches have been recently explored in text style transfer,
where a textual prompt is used to query a pretrained language model to generate
style-transferred texts word by word in an autoregressive manner. However, such
a generation process is less controllable and early prediction errors may
affect future word predictions. In this paper, we present a prompt-based
editing approach for text style transfer. Specifically, we prompt a pretrained
language model for style classification and use the classification probability
to compute a style score. Then, we perform discrete search with word-level
editing to maximize a comprehensive scoring function for the style-transfer
task. In this way, we transform a prompt-based generation problem into a
classification one, which is a training-free process and more controllable than
the autoregressive generation of sentences. In our experiments, we performed
both automatic and human evaluation on three style-transfer benchmark datasets,
and show that our approach largely outperforms the state-of-the-art systems
that have 20 times more parameters. Additional empirical analyses further
demonstrate the effectiveness of our approach.
comment: Accepted by EMNLP Findings 2023
♻ ☆ Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, Liping Jing
The emergence of ChatGPT has recently garnered significant attention from the
computational linguistics community. To demonstrate its capabilities as a
keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the
keyphrase generation task. We evaluate its performance in various aspects,
including keyphrase generation prompts, keyphrase generation diversity, and
long document understanding. Our evaluation is based on six benchmark datasets,
and we adopt the prompt suggested by OpenAI while extending it to six candidate
prompts. We find that ChatGPT performs exceptionally well on all six candidate
prompts, with minor performance differences observed across the datasets. Based
on our findings, we conclude that ChatGPT has great potential for keyphrase
generation. Moreover, we discover that ChatGPT still faces challenges when it
comes to generating absent keyphrases. Meanwhile, in the final section, we also
present some limitations and future expansions of this report.
comment: Technical Report, 6 pages
♻ ☆ Guiding Language Model Reasoning with Planning Tokens
Large language models (LLMs) have recently attracted considerable interest
for their ability to perform complex reasoning tasks, such as chain-of-thought
reasoning. However, most of the existing approaches to enhance this ability
rely heavily on data-driven methods, while neglecting the structural aspects of
the model's reasoning capacity. We find that while LLMs can manage individual
reasoning steps well, they struggle with maintaining consistency across an
entire reasoning chain. To solve this, we introduce 'planning tokens' at the
start of each reasoning step, serving as a guide for the model. These token
embeddings are then fine-tuned along with the rest of the model parameters. Our
approach requires a negligible increase in trainable parameters (just 0.001%)
and can be applied through either full fine-tuning or a more
parameter-efficient scheme. We demonstrate our method's effectiveness by
applying it to three different LLMs, showing notable accuracy improvements
across three math word problem datasets w.r.t. plain chain-of-thought
fine-tuning baselines.
comment: 10 pages, 4 figures
♻ ☆ Developing Interactive Tourism Planning: A Dialogue Robot System Powered by a Large Language Model
In recent years, large language models (LLMs) have rapidly proliferated and
have been utilized in various tasks, including research in dialogue systems. We
aimed to construct a system that not only leverages the flexible conversational
abilities of LLMs but also their advanced planning capabilities to reduce the
speaking load on human interlocutors and efficiently plan trips. Furthermore,
we propose a method that divides the complex task of a travel agency into
multiple subtasks, managing each as a separate phase to effectively accomplish
the task. Our proposed system confirmed a certain level of success by achieving
fourth place in the Dialogue Robot Competition 2023 preliminaries rounds. We
report on the challenges identified through the competition.
comment: This paper is part of the proceedings of the Dialogue Robot
Competition 2023
♻ ☆ NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning
Our goal is a modern approach to answering questions via systematic reasoning
where answers are supported by human interpretable proof trees grounded in an
NL corpus of authoritative facts. Such a system would help alleviate the
challenges of interpretability and hallucination with modern LMs, and the lack
of grounding of current explanation methods (e.g., Chain-of-Thought). This
paper proposes a new take on Prolog-based inference engines, where we replace
handcrafted rules with a combination of neural language modeling, guided
generation, and semiparametric dense retrieval. Our implementation, NELLIE, is
the first system to demonstrate fully interpretable, end-to-end grounded QA as
entailment tree proof search, going beyond earlier work explaining
known-to-be-true facts from text. In experiments, NELLIE outperforms a
similar-sized state-of-the-art reasoner [Tafjord et al., 2022] while producing
knowledge-grounded explanations. We also find NELLIE can exploit both
semi-structured and NL text corpora to guide reasoning. Together these suggest
a new way to jointly reap the benefits of both modern neural methods and
traditional symbolic reasoning.